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A simple growth model for DNA evolution is introduced which is analytically 
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solvable and reproduces the observed statistical behavior of real sequences. 
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The study of fractal structures in DNA chains is a subject of great interest nowadays. 
Two exciting reports of the existence of long-range correlations, which implies fractal struc- 
ture, in DNA sequences have been made recently by Li [1] and Peng et al. [2,3]. 

DNA sequences are formed of four different nucleotides: Adenine, Guanine, Thymine and 
Cytosine, which are represented by the letters A, G, T, and C, respectively. The first two are 
called purines and last ones pyrimidines. Roughly, half of the nucleotides in a chromosome 
are pyrimidines and half are purines. 

It is observed that the length of the DNA chain of evolved species, such as mammals, is 
much longer than the DNA of simple organisms, like bacteria [4]. Thus, DNA sequences are 
in a dynamical state, with its length increasing during evolution. Often the chain increases 
by gene duplication. Extra copies of the gene are stored in the same sequence as the original 
copy. Consequently, DNA sequences of evolved species usually have repetitive patterns 
[4]. However, the copy is not always identical to the original gene. During the process of 
duplication a mutation can occur, and more complex functions can appear in the evolving 
species. In our view, the reason for this is that it causes less disruption to normal functioning 
to keep a copy of the unmutated gene, because if the mutated gene does not work perfectly, 
the first copy can continue to function. Sometimes, the repeated segments are next to each 
other, sometimes they are separated. The length of these repeated segments varies greatly; 
some of them are short, others are as long as the entire gene. A set of genes descended by 
duplication and variation of some ancestral gene is called a gene family [4]. Copy numbers 
in a given family can vary from two to millions. 

It has been found that the content of repetitive DNA tends to increase with increasing 
chain length. Another important observation is that only a small fraction of the DNA chain 
is active, namely, codes for protein. The rest of it apparently has no function [4]. In humans 
the percentage of noncoding DNA can be larger than 90%. 

Peng et al. [2] analyzed the statistical properties of real DNA chains by constructing a 
graphical representation of the sequence. Starting from the origin, a walker moves either 
"up" [u{i) = if a pyrimidine (C or T) occurs at a position i or the walker steps "down" 
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[u[i) = —1] if a purine (A or G) occurs at position i. Next, they calculate the cumulative 
displacement y[l) of the walker after / steps, which is the sum of the unit steps u[i) at each 
position i. Thus, y[l) = 1^8=1 A fractal landscape is constructed by plotting y[l) 

versus /. 

The walk is characterized by the root mean square fluctuation F[l) about the average 
displacement; F[l) is defined in terms of the difference between the average of the square 

= — 2 

and the square of the average F'^{1) = [Aj/(/) — Aj/(/)]2 = [Aj/(/)]2 — Ay[l) , of a quantity 
Aj/(/) defined by Aj/(/) = j/(/o + /) — j/(/o). Here the bars indicate an average over all positions 
Iq in the sequence. 

Three possibilities exist for F[l): (i) if the walk is random, then F[l) ~ /^/^. (ii) If there 
are local correlations extending up to a characteristic range, then asymptotically F[l) ~ /^/^ 
would be unchanged from the purely random, (iii) If there is no characteristic length (i.e., 
if the correlation were "infinite-range"), then F[l) ~ /" with a ^ 1/2. 

In Refs. [2,5,6] was found that a > 0.5 for noncoding segments and a = 0.5 for coding 
segments. On the other hand, in Ref. [7] it was reported that in most cases log F[l) against 
log / is not a straight line. It was found nonlinear curves both for coding segments and those 
containing noncoding regions with a local slope larger than 0.5. According to these results, 
a well-defined fractal power exponent a does not seem to exist for DNA sequences. The 
behavior found has been that, for small /, a is approximately 0.5 then increases monotonically 
to a maximum value. Beyond this, the statistics have not been good enough to allow a 
conclusion about the asymptotic value of a in the limit of very large /. However, there are 
clear indications in [3] that a decreases after attaining a maximum, although this behavior 
was not investigated in detail for real sequences. 

We introduce here a simple iterative model for gene evolution, which is analytically 
solvable and reproduces the behavior of the exponent a found in [3,7] for real DNA sequences 
of intermediate size. The model incorporates basic features of DNA evolution, that is, 
sequence elongation due to gene duplication and mutations. When very large chains are 
analyzed in our model, the exponent a is back to the trivial value 0.5. Therefore, our model 
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suggests that the nontrivial value found for a in [2] is probably due to finite size effects. 
We are aware of two other models [f ,8] for DNA evolution, but in both of them the critical 
exponents do not go back to the trivial values of a random sequence as the chain gets very 
large. 

Our dynamical model for DNA evolution is the following: we start with a coding sequence 
which consists of g genes of equal length. Each gene has n nucleotides, which consists of 
pyrimidines (u = +f) and purines (u = — f) randomly distributed in a proportion of 50% 
each. In the simulated evolutionary process we choose at random a gene of our original 
sequence and in that gene we choose at random one of its nucleotides. Then, we change 
(mutate) this nucleotide from a pyrimidine to a purine or vice- versa. In real DNA a mutation 
in a single nucleotide is called a point mutation [4]. A copy of the chosen gene before the 
mutation is added at the end of the chain. This old gene becomes a noncoding segment, and 
it is not further modified. The genes that can mutate are always the first g genes, which are 
the coding DNA in our model. We iterate this process Ni times [Ni >> I). At the end, our 
chain will consist of a "head" of g coding genes and a big "tail" of Ni noncoding segments. 
The final length of the chain is L = n[g -\- Ni). 

It is clear that our model is a simplified version of what is found in real sequences. Some 
of the features that our model misses are the following: In real DNA, (a) the genes do not 
have the same size; (b) not all sites of a gene have identical probability of mutation; (c) the 
distribution of purines and pyrimidines does not have a random uniform distribution; (d) a 
point mutation can be from a purine (pyrimidine) to the other kind of purine (pyrimidine); 
(e) noncoding segments are interdispersed with coding ones, and (f) mutations can also 
occur via deletions and insertions of genetic material. Our preliminary results show that if 
we introduce in the model the features (a) to (e) the results shown here are changed only 
quantitatively (this will be reported elsewhere). We have not studied yet feature (f). 

In Fig. I we show a fractal landscape of our growth model, given by y[l) versus /, for the 
parameter values g = 3, n = 10 and Ni = 10"*. The landscape presents a jagged contour, as 
found in real DNA sequences [2], indicating the possible existence of long-range correlations. 
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The mean square fluctuation F[l) is related to the autocorrelation function C{1) = 

2 

u[lo)u[lo + /) — u[lo) through the relation F'^{1) = Y,i=i,i lZj=i,i C{j — i). In our model we 
can calculate analytically C{1) and consequently F[l) and a[l). Let us flrst analyze the 
term m(/o)m(/o + /) that appears in C{1). Since in our original random sequence of coding 
segments the nucleotides are not correlated to each other, this implies that correlations exist 
only for sites which are at a distance that is a multiple of n. When this is not the case, 
u[lo)u[lo + /) will be an average of L — l random terms with values ±1. Because of the central 
limit theorem, this term will be a gaussian random variable with zero mean and variance 
equal to 1 /V L — I. If we consider many realizations of the growth process, the ensemble 
average of this random variable will be zero. Consequently, in this limit m(/o)m(/o + /) = 
for / ^ mn (m = 1, 2, 3, ...) and 1^0. For / = one obviously has m(/o)m(/o) = 1. 

To flnd m(/o)m(/o + /) for the case in which / = mn will consider flrst the case of a 
single gene, = 1, with n > 1. When the gene is duplicated a mutation occurs and we 
have with probability pi = 1/n that u[lo + n) has a different sign as the one of m(/o), 
and with probability 1 — pi that their signs are the same. This is valid for any /q. Then 
A = u[lo)u[lo + n) = 1 — 2pi = [n — 2)/n, since u[lo) = ±1. We have the probability 
P2 = 2[n — for u[lo + 2n) being different from m(/o) and l—p2 for them being identical. 

Consequently, u[lo)u[lo + 2n) = [n — 2)^/n^ = A^. We can continue the process and easily 
flnd that for the case of a single gene u[lo)u[lo + mn) = A™. 

For g > I we have to consider that u[lo + n) has the probability 1/g of being originated 
from the same gene as m(/o). Consequently, it has probability pi = l/[ng) of being different 
from m(/o), and (n — l)/[ng) of being identical. Note that here the sum of these two 
probabilities is not one, since there exists also the probability [g — l)/g that u[lo + n) is 
originated from a different gene, and this gives on average a vanishing contribution for the 
term u[lo)u[lo + n). In this way, u[lo)u[lo + n) = A/g. Between u[lo + n) and u[lo + 2n) we 
see the following differences: there are g^ possible different combinations for the gene that 
contains u[lo + 2n). One of these combinations corresponds to a gene that is mutated twice 
and (1 — g) possibilities that a gene is mutated once. This will result in u[lo)u[lo + 2n) = 
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[A^ + [g — l)A)/g. It is not difficult to find that there are possible different combinations 
for the gene that contains u[lo + 3n). One of these possibilities corresponds to a gene that is 
mutated three times, 2[g — I) that a gene is mutated twice and [g — I)^ the gene is mutated 



once. Consequently, m(/o)m(/o + 3n) = A[A + g — 1]'^ /g^. In this way, we can continue the 
process and find that m(/o)m(/o + mn) = A[A + g — I]™~^/(/™. These results are valid for a 
large number of realizations of the growth process. However, note that the ensemble average 
is only important if the chain is not very large, because otherwise the spatial average will 
already give good statistical results. 



To determine m(/o) in the expression for C(/), we note that even if the number of 
pyrimidines is different from the number of purines in the initial chain of coding segments, 
after few iterations the probability of having u = —1 will be identical to the probability of 
having u = +1. The easiest way to show this is for the case of a single gene. Consider a 
nucleotide with m(/o) = +1. The probability of having u[lo + mn) = u[lo + (m — l)n) 
is = 2i^p™-i + 1(1 _ P"-i), with P° = 1. From this recursive relation we find 
P™ = (1 -(- ^ -(- ^2 _j_ _j_ The sum of this geometric series converges to 1/2. 

Therefore, for large m, the probability of finding a nucleotide with u[lo + mn) = +1, as 
it was in the beginning of the iteration process, is identical to the probability of finding 



u{lo + mn) = — 1. Since by definition u{l) = j- X]/o=i,i ^(^o) and u{lo) has probability 1/2 of 
being +1 and 1/2 of being —1, it turns out that J2io=i L ^i^o) evolves as a symmetric random 
walker in one-dimension. It is well known that such a walker diffuses away from the origin 



wi 



ith average distance of \/P- Consequently, u{l) l/\/P- 
Combining the results found above, we have the following expression for the autocorre- 
lation function 

'1-1/P, if/ = 0, 
C(/, L) = I A[A + g - l]"-!/^" -l/L, if / = mn, (1) 
— 1/P, otherwise. 

This shows that the autocorrelation function is a function of / and of the length L of the 
chain. The above equation does not apply when n = 1 and = 1, where C(/, L) is a period- 
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two function. Also, for the case n = 1 and > 1 all sites are uncorrelated to each other, 



which results in u[lo)u{lo + /) = when 1^0. Finally, the case n = 2 is not interesting since 
it results in A = 0. Therefore, we will concentrate our attention to cases in which n > 2. 

From Fq. (1) we can obtain the correlation length /, which is determined by the smallest 
value of /, with / = mn, satisfying C(/, L) = 0. Namely, / = mn, where m is given by 



m = log 



A + g-1 



'log 



A + g-1 



LA 

Since F[l^L) can be obtained from C{l,L), and if we take into consideration that 
C{-l,L) = C{l,L), we find 



F(/,Z)= //[1 + 2 ^ C(j,Z)]-2 ^ jC(j,Z). (3) 

V j=i,' j=i,' 

As in [3], we calculate the local slope L) of log^^Q L) versus log^^Q Z^, given by 

a{U,L} = . (4) 

logio h+i - logio k 

If we take li^i = U + A;, then 



^ [l + 2 L)] + 2 Ea+i(/. - L) 

(5) 



For I < ng OT large /, we have C(/, Z) of the order of — l/Z, this implies that for very large 
chains Fq. (5) g 0.5. For intermediate values of / we have that a presents a spiked 

behavior at / = mn, as a consequence of the spikes present in the correlation function (see 
Fq. 1). We observe that if we average L) for n consecutive values of /, with / > ng^ the 
curve a versus log / becomes smooth. Another equivalent way to smooth L) is to take 
k = n and li a multiple of n/2. This is the method we use here. 

We show in Fig. 2(a) log^Q F[l^L) versus log^^Q / for n = 10, g = 3 (solid), n = 10, 
g = 10 (dashed) and n = 6, (/ = 3 (dotted). In all the three cases, Ni = 10"* . As in real 
DNA sequences [3,7] we do not find a straight line. Therefore, also in our growth model 
a well-defined fractal power exponent a does not exist. In Fig. 2(b) we show a versus 
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log / associated with the curves shown in Fig. 2(a). We see that the local slope a starts 
from 1/2, then increases to a maximum value, and for large / decreases again to 1/2. For 
these parameters values we find from Fq. (2) log^^Q / 3.17, 3.64, 2.69, respectively, which 
correctly mark the region in which a[l^L) returns to the trivial value 1/2. We see that, for 
small and large / the exponent a is the one of a random sequence. Our results for n = 10, 
g = 10 and n = 10, = 3 are in good agreement with Figs. 4 and 8 of [3]. However, in those 
figures the statistics for large / are not as good as in our case, and a presents considerable 
fiuctuations. The results obtained from our analytical expressions were compared with 
the ones obtained by direct simulation of the growth process, and they were in excellent 
agreement with each other. Our picture indicates that for large /, L) = 1/2 in real DNA 
sequences if chains sufficiently big were analyzed in [3] . 

We also analyzed the Fourier spectrum of the numerical sequence obtained through the 
random walk method described above. We see at large frequencies a white noise component 
(with peaks located at / = 1/n and its respective harmonics, refiecting the presence of the 
spikes at / = mn in the autocorrelation function). For intermediate frequencies we find 1/ f 
noise and at small frequencies the power spectrum again fiattens off. This can be observed 
only if the chain is very large. If this is not the case, only the 1// and the white noise 
regions (at large frequencies) are observed (more details will be published elsewhere). This 
is exactly what has been seen in real DNA sequences [1,9]. By studying very large sequences, 
we are able to show that the 1// noise in real DNA chains is apparently only a transient 
behavior observed in chains that are not sufficiently large. 

In conclusion, we have introduced a growth model for DNA evolution which incorporates 
basic features of a DNA growth process, that is, sequence elongation due to gene duplication 
and mutations. Our model gives the same statistical results of real sequences investigated 
in the literature, that is, the apparent existence of long-range correlations. However, when 
larger chains are studied, we see that this is only a transient behavior valid for chains not 
sufficiently big. From our studies we suggest that a possible way to find the asymptotic 
value of a in real sequences, eliminating the statistical fiuctuations, is to divide the chain in 
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big segments and study the ensemble average of this exponent. 
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FIGURES 

FIG. 1. Fandscape (walk) representation of our model, with n = 10, g = 3 and Ni = 10'*. As 
in [7], for a more convenient representation, we removed the trend of the landscape in such a way 
that the end point has the same vertical displacement as the starting point. 

FIG. 2. (a) logiQ F(l,L) versus log;^Q/. (b) a(l,L) versus log;^Q/, for n = 10, g = 3 (solid), 
n = 10, ^ = 10 (dashed) and n = 6, g = 3 (dotted) with Ni = 10"*. 
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